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Abstract 



Recent work in data mining and related areas has highlighted the 
QQ ' importance of the statistical assessment of data mining results. Cru- 

■ cial to this endeavour is the choice of a non-trivial null model for the 

data, to which the found patterns can be contrasted. The most influ- 
ential null models proposed so far are defined in terms of invariants 



in 

' of the null distribution. Such null models can be used by computa- 

^\ . tion intensive randomization approaches in estimating the statistical 

significance of data mining results. 

Here, we introduce a methodology to construct non-trivial prob- 
^ . abilistic models based on the maximum entropy (MaxEnt) principle. 

H ' We show how MaxEnt models allow for the natural incorporation of 

prior information. Furthermore, they satisfy a number of desirable 
properties of previously introduced randomization approaches. Lastly, 
they also have the benefit that they can be represented explicitly. We 
argue that our approach can be used for a variety of data types. How- 
ever, for concreteness, we have chosen to demonstrate it in particular 
for databases and networks. 



1 Introduction 



Data mining practitioners commonly have partial a partial understanding of 
the structure of the data investigated. The goal of the data mining process 
is then to discover any additional structure or patterns the data may exhibit. 
Unfortunately, structure that is trivially implied by the prior information 
available is often overwhelming, and it is hard to design data mining algo- 
rithms that look beyond it. We believe that adequately solving this problem 
is a major challenge in current data mining research. 

For example, it should not be seen as a surprise that items known to be 
frequent in a binary database are jointly part of many transactions, as this 
is what should be expected even under a model of independence. Similarly, 
in random network theory, a densely connected community of high degree 
nodes should probably be considered less interesting than a similarly tight 
community among low degree nodes. 

Rather than discovering patterns that are implied by prior information, 
data mining is concerned with the discovery from data of departures from 
this prior information. To do this, the ability to formalize prior information 
is as crucial as the ability to contrast patterns with this information thus 
formalized. In this paper, we focus on the first of these challenges: the 
task of designing appropriate models incorporating prior information in data 
mining contexts. 

We advocate the formalization of prior information using probabilistic 
models. This enables one to assess patterns using a variety of principles 
rooted in statistics, information theory, and learning theory. It allows one to 
formalize the informativeness of patterns using statistical hypothesis testing — 
in which case the probabilistic model is referred to as the null model — 
, the minimum description length principle, and generalization arguments 

[iniEiisoinaisiEiE]. 

The most flexible and influential probabilistic models currently used in 
data mining research have been defined implicitly in terms of randomization 
invariants. Such invariants have been exploited with success by computa- 
tionally intensive approaches in estimating the significance (quantified by 
the p- value) of data mining results with respect to the null model they define 

Unfortunately, null models defined in terms of invariants cannot be used 
to define practical measures of informativeness that can directly guide the 
search of data mining algorithms towards the more interesting ones. To be 
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able to do this, explicit probabilistic models that take prior information into 
account, are needed. Applications of models that are defined implicitly in 
terms of invariants seem to be limited to post-hoc analyses of data mining 
results only. 

Despite their potential, the development and use of explicit models is 
rare in data mining literature, in particular in the study of databases and 
networks — the main focus of this paper. Furthermore, most of the models 
that have been proposed elsewhere suffer from serious shortcomings or have 
a limited applicability (see Discussion in Sec. [5]). 

In this paper, we present a methodology for efficiently computing ex- 
plicitly representable probabilistic models for general types of data, able to 
incorporate non-trivial types of prior information. Our approach is based on 
the maximum entropy (MaxEnt) principle [8]. Although the methodology is 
general, for concreteness we focus on rectangular databases (binary, integer, 
or real-valued) as well as networks (weighted and unweighted, directed and 
undirected). We further demonstrate remarkably strong connections between 
these MaxEnt models and the aforementioned randomization approaches for 
databases and networks. 

Outline In the remainder of the Introduction, we will first discuss the 
maximum entropy principle in general (Sec. 11.11) . Then we will discuss how 
rectangular databases and networks are trivially represented as a matrix 
(Sec. II. 2p . and a way to formalize a common type of prior information for 
databases and networks as constraints on that matrix (Sec. 11.31) . These 
results allow us to study the design of MaxEnt models for matrices in general, 
quite independently of the type of data structure it represents (Sec. |2]). In 
Sec. [3], we relate the MaxEnt distributions to distributions defined implicitly 
using swap randomizations. We then provide experiments demonstrating 
the scalability of the MaxEnt modelling approach in a number of settings 
(Sec. S]). In the Discussion (Sec. E]) we point out relations with literature and 
implications the results in this paper may have for data mining research and 
applications. 
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1.1 The maximum entropy principle 

Consider the problem of finding a probability distribution P over the data 
X G that satisfies a set of linear constraints of the form: 

5^P(x)/,(x) = d,. (1) 

X 

Below, we will show that prior information in data mining can often be 
formalized as such. For now, let us focus on the implications of such a set of 
constraints on the shape of the probability distribution. First, note that any 
probability distribution satisfies the extra conditions 

5^P(x) = l , P(x)>0. 

X 

In general, these constraints will not be sufficient to uniquely determine the 
distribution of the data. The most common strategy to overcome this prob- 
lem is to search for the distribution that has the largest entropy subject to 
these constraints, to which we will refer as the MaxEnt distribution. Math- 
ematically, it is found as the solution of: 

maxp(x) -^P(x)logP(x), 

X 

s.t. ^P(x)/,(x) = d„ (V^) (2) 

X 

5^P(x) = l. (3) 

X 

Originally advocated by Jaynes [3 [S] as a generalization of Laplace's 
principle of indifference, the MaxEnt distribution can be defended in a variety 
of ways. The most common argument is that any distribution other than 
the MaxEnt distribution effectively makes additional assumptions about the 
data that reduce the entropy. As making additional assumptions biases the 
distribution in undue ways, the MaxEnt distribution is the safest bet. 

A lesser known argument, but not less convincing, is a game-theoretic 
one p!9]. Assuming that the true data distribution satisfies the given con- 
straints, it says that the compression code (e.g. Huffman) designed based on 
the MaxEnt distribution minimizes the worst-case expected coding length 
of a message coming from the true distribution. Hence, using the MaxEnt 
distribution for coding purposes is optimal in a robust minimax sense. 
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Besides these motivations for the MaxEnt principle, it is also relatively 
easy to compute a MaxEnt model. Indeed, the MaxEnt optimization problem 
is convex, and can be solved conveniently using the method of the Lagrange 
multipliers. Let us use Lagrange multiplier for constraint ([3]) and Aj for 
constraints ([2]). The Lagrangian is then equal to: 

L(/i,A,,P(x)) = -5^P(x)logP(x) 

X 

(^X]p(x)/.(x)-rf.j 

Equating the derivative with respect to P(x) to yields the optimality con- 
ditions: 

logP(x) = /x-l + 5^A,/,(x). 

i 

Hence, the MaxEnt solution belongs to the exponential family of distributions 
of the form: 



P(x) = exp |^^-l + ^A,/,(x)j 



(4) 



The Lagrange multiphers should be chosen such that the constraints ([3]) 
and ([2]) are satisfied. These values can be found by minimizing the La- 
grangian after substituting Eq. (jlj) for P(x), resulting in the so-called the 
dual objective function. After some algebra, we find: 

^(/w, Ai,P(x)) = ^exp ^/i - 1 + ^ Ai/i(x)j 
-yu - ^ \di. 

i 

This is a smooth and convex function, which can be minimized efficiently 
using standard techniques (see Sec. 12.21) . 
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1.2 Matrix representations of databases and networks 



In this paper, we will apply the MaxEnt principle for the purpose of inferring 
probabilistic models for rectangular databases as well as networks. Both 
these data structures can be represented as a matrix, which we will denote 
as D. This is why they are conveniently discussed within the same paper. 
Our models will be models for the random matrix D, which is directly useful 
in modelling databases as well as networks D may represent. 

For the case of databases, the matrix D has m rows and n columns, and 
D(2, j) represents the element at row i and column j. For networks, the ma- 
trix D represents the n x n adjacency matrix. It is symmetric for undirected 
networks and potentially asymmetric for directed networks. For undirected 
networks D(2,j) = 'D{j^i) contains the weight of the edge connecting nodes 
i and j, whereas for directed networks D(z, j) contains the weight of the di- 
rected edge from node i to node j. In many cases, self-loops would not be 
allowed, such that D(z, z) = for all i. 

To maintain generality, we will assume that all matrix values belong to 
some specified set V C 3?"*", i.e.: D(i, j) G V. Later we will choose the set V 
to be the set {0, 1} (to model binary databases and unweighted networks), 
the set of positive integers, or the set of positive reals (to model integer- 
valued and real- valued databases and weighted networks). Other choices can 
be made, and it is fairly straightforward to adapt the derivations accordingly. 

For notational simplicity, in the subsequent derivations we will assume 
that T) is discrete and countable. However, if T) is continuous (such as the 
set of positive reals) the derivations can be adapted easily by replacing sum- 
mations over T) with integrals. 

1.3 Prior information for databases and networks 

For binary databases, it has been argued that row and column sums (also 
called row and column marginals) can often be assumed as prior information. 
Any pattern that can be explained by referring to row or column sums in a 
binary database is then deemed uninteresting. Previous work has introduced 
ways to assess the significance of data mining results based on this assumption 
0, El |6]. These methods were based on the fact that the set of databases 
with fixed row and column sums is closed under so-called swaps. 

Swaps are operations that transform any 2x2 submatrix of the form 
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Figure 1: The effect of a swap operation to a binary database (left) and to 
an unweiglited directed network. 



into ( ^ n 1 • Clearly, such operations leave the row and column 



sums invariant. Furthermore, it can be shown that by iteratively applying 
swap randomizations to D, any matrix with the same row and column sums 
can be obtained. Thus, randomly applying swaps provides a suitable mech- 
anism to sample from the set of databases with fixed column and row sums. 
See Fig. [1] on the left for a graphical illustration. 

The swap operation has later been generalized to deal with real-valued 
databases as well [12]. 

Similar ideas have been developed quite independently in the context of 
network analysis, and in particular in the search for recurring network motifs 
[To] . There, swaps were applied to the adjacency matrix, corresponding to a 
rewiring of the kind depicted on the right in Fig. [TJ Randomized networks 
obtained using a chain of such edge swaps were used to statistically assess 
the significance of particular recurring network motifs in biological networks 

m- 

Note that the sum of row i of an adjacency matrix D corresponds to 
the out-degree of node i, whereas the sum of column j corresponds to the 
in-degree of node j. Clearly, swaps of this kind leave the in-degree and out- 
degree of all nodes invariant. I.e., by using swap operations on networks, one 



7 



can sample from the set of all networks with given in-degree and out-degree 
sequences. 

In summary, whether D represents a database or a network, the invariants 
amount to constraints its row and column sums. The models we will develop 
in this paper are based on exactly these invariants, be it in a somewhat 
relaxed form: we will assume that the expected values of the row and column 
sums are equal to specified values. Mathematically, this can be expressed as: 



where is the z'th expected row sum and dj the j'th expected column sum. 
Although they have been developed for binary databases [5] and unweighted 
networks [TU], and later extended to real- valued databases [12], we will ex- 
plore the consequences of these constraints in broader generality, for various 
types of databases and weighted networks. 

Importantly, it is easy to verify that these constraints are exactly of the 
type of Eq. ([1]), such that the MaxEnt formalism is directly applicable. 



2 MaxEnt Distributions with Given Expected 
Row and Column Sums 



We have now pointed out that databases as well as networks can be repre- 
sented using a matrix over a set V of possible values. Furthermore, commonly 
used constraints on models for databases as well as for networks amount to 
constraining the column and row sums of this matrix to be constant. Thus, 
we can first discuss MaxEnt models for mx n matrices D in general, subject 
to row and column sum constraints. Then we will point out particularities 
and adjustments to be made for these models to be applicable for matrices 
representing databases or networks. 

The MaxEnt distribution subject to constraints on the expected row and 
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column sums is found by solving: 

maxp(D) -5^P(D)log(P(D)), 



D 



5^P(D)K^D(^,j) 

D \ i 



(5) 

(6) 
(7) 



D 



The resulting distribution will belong to the exponential family, and will 
be of the form of Eq. (jl]): 



PfDl 



exp 



exp(/x-l)J]exp (D(^,j)(A[ + A^^)), 



(8) 



where is the Lagrange multiplier for constraint ([7]), A[ are the Lagrange 
multipliers for constraints ([5]), and for constraints ([6]). 

The first factor in this expression is a normalization constant, the value 
of which can be determined using constraint ([7]): 

exp(l-^) = J2 n^^P(D(^'-^')(^^ + ^'))' 
= n E exp(D(^,j)(AI + A,^)), 

i,j T>{i,j)GV 
id 
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Table 1: Three possible domains for the elements of D, the corresponding 
normalization factors in the MaxEnt distribution -P(D) for the matrix ele- 
ment T){i,j), and the resulting type of distribution for the matrix elements. 



V 


l/^(Ar,A^) 


Distribution 


{0,1} 


1/(1 


+ exp(A[ + Ap) 


Bernoulli 


N 


1 - 


- exp(A[ + Ap 


Geometric 


R+ 




-iK + A?) 


Exponential 



where Z(A[, Ap = ED{.,,)ei. ^xp (D(z, j)(A[ + Ap). 
Plugging this into Equation ([8]) yields: 

^(D) = n ^(A- A-) "^P (D(^,j)(AI + A-)), 

id 

where 

Pimj)) = zl^^ (D(^, J)(A[ + A,^)) (9) 

is a properly normalized probability distribution for the matrix element 
D(i,j) at row i and column j. Hence, the MaxEnt model factorizes as a 
product of independent distributions for the matrix elements. It is important 
to stress that we did not impose independence to start. The independence 
is a consequence of the MaxEnt objective. 

Various choices for T> will lead to various distributions, with appropriate 
values for the normalization constant Z(A[, Ap. For V binary, the Max- 
Ent distribution for D is a product of independent Bernoulli distributions 
with probability of success equal to i^'^xp'(A'^+a'') j). For V the set 

of positive integers, the distribution is a product of independent geometric 
distributions with success probability equal to exp(A[ + Ap for D(z, j). And 
for V the set of positive reals, the distribution is a product of independent 
exponential distributions with rate parameter equal to — (A[ + A^) for D(^, j). 
This is summarized in Table [H 
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2.1 MaxEnt models for databases and networks 

The MaxEnt matrix approach is directly apphcable for modeUing databases 
of size m X n, and no further modifications are required. 

Similarly, for directed networks, given the in-degrees and out-degrees, a 
MaxEnt model can be found by using the approach outlined above. 

Small modifications are needed for the case of undirected networks where 
D(i,j) = D{j,i). From symmetry, it follows that at the optimum A[ = 
and we can omit the superscripts r and c, such that the solution looks like: 

p(D) = iipm,j)), 

id 

Pm,j)) = ^^^^exp(D(^,j)(A. + A,)), 

where Z{Xi, Xj) = ED(i,j)Go6^P (D(i, j)(Ai + A^)). 

Another small modification is needed for networks where self-loops are 
not allowed, such that JD{i,i) = 0. For V C 9fJ+ these constraints can be 
enforced quite easily by requiring P(D)D(i, z) = for all i, which are 
again constraints of the form of Eq. ([1]). The resulting optimal distributions 
are identical in shape to the model with self-loops, apart from the fact that 
self-loops receive zero probability and the Lagrange multipliers would have 
slightly different values, at the optimum. 

2.2 Optimizing the Lagrange multipliers 

We have now derived the shape of the models -P(D), expressed in terms of 
the Lagrange multipliers (also known as the dual variables), but we have not 
yet discussed how to compute the values of these Lagrange multipliers at the 
MaxEnt optimum. 

In Sec. II. H we have briefly outlined the general strategy to do this, as 
dictated by the theory of convex optimization [2]: the solution for -P(D) in 
terms of the Lagrange multipliers should be substituted into the Lagrangian 
of the MaxEnt optimization problem. The result is a smooth and convex 
function of the Lagrange multipliers, and minimizing it with respect to the 
Lagrange multipliers yields the optimal values. 

Let us investigate what this means for the case of general mxn matrices. 
The number of constraints and hence the number of Lagrange multipliers 
is equal to m + n + 1, which is sublinear in the size of the data mn. The 
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optimal values of the parameters is found easily using standard methods for 
unconstrained convex optimization such as Newton's method or (conjugate) 
gradient descent, possibly with a preconditioner [151 [2]. We will report results 
for two possible choices in the Experiments Section. 

In certain cases, the computational and space complexity can be further 
reduced, in particular when the number of distinct values of d"^ and of dj are 
small. Indeed, if d^ = d^ for specific i and k, the corresponding Lagrange 
multipliers and will be equal as well, reducing the number of free 
parameters. 

Especially for V = {0, 1}, this situation is the rule rather than the excep- 
tion. Indeed, when the d^ and dj are computed based on a given database 
with m rows and n columns, it is readily observed that both the number of 
distinct row sums d^ and column sums dj are upper bounded by min(m, n), 
significantly reducing the complexity when min(m, n) -C max(m,n). Fur- 
thermore, the number of different nonzero row sums as well as the number 
of different nonzero column sums in a sparse matrix with s nonzero elements 
is at most v^2i. Hence, the number of dual variables is upper bounded by 
2min(m, n, ^/2s). Furthermore, this bound can be sharpened by d^^^-^ + d^g^^ 
with d^^ and d'^^^ upper bounds on the row and column sums. For V the 
set of integers, similar bounds can be obtained. 

At first sight, it may seem to be a concern that the MaxEnt model is a 
product distribution of independent distributions for each D{i,j). However, 
it should be pointed out that one does not need to store the value of A[ + A^ 
for each pair of i and j. Rather, it suffices to store just the A[ and A^ to 
compute the probabilities for any D(z,j) in constant time. Hence, also the 
space required to store the resulting model is 0(m + n), sublinear in the size 
of the data. 

For each of the models discussed in this paper we will make the code 
freely available. 

3 The Invariance of MaxEnt Matrix Distri- 
butions to ^-Swaps 

We have motivated the use of constraints on the expected row and column 
sums by relying on previous work where row and column sums of a database 
or a network adjacency matrix was argued to be reasonable prior information 
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a data mining practitioner may have about the problem. In this prior work, 
the authors devised ways to generate new random data satisfying this prior 
information, by randomizing the given database or network using swaps. 
These swaps allow one to sample from the uniform distribution over all bi- 
nary databases with given row and column sums |5j, or from all networks 
with given in-degrees and out-degrees [lU]. Later, the swap operation was 
generalized to allow randomizing a real- valued data matrix as well |12j . 

We believe the MaxEnt models introduced in this paper are most inter- 
esting in their own right, being explicitly represented, easy to compute, and 
easy to sample random databases or network adjacency matrices from. Still, 
it is instructive to point out some relations between them and the previously 
proposed swap operations. 

3.1 (5-swaps: a randomization operation on matrices 

First, let us generalize the definition of a swap as follows. 

Definition 3.1 (5-swap). Given an m x n matrix D, a 5-swap for rows i, k 
and columns j, I is the operation that adds a fixed number 6 to D(z, k) and 
D(j, /) and subtracts the same number from D{i,l) and D(j, k). 

Of course, for a 5-swap to be useful, it must be ensured that D{i,j) + 
6, D{k,j) — 6, D(i, /) — 6, D(/c, I) + 6 E V. We will refer to such (5-swaps as 
allowed 5-swaps. 

Definition 3.2 (Allowed 5-swap). A 5-swap for rows i,k and columns j,l 
is said to be allowed for a given matrix D over the domain V iff D(i,j) + 
5,'D{k,j) - 6,'D{i,l) - d^BikJ) + 6 e V. 

Clearly, an allowed 5-swap leaves the row and column sums invariant. 
The following Theorem is more interesting. 

Theorem 3.1. The probability of a matrix D under the MaxEnt distribu- 
tion subject to equality constraints on the expected row and column sums is 
invariant under allowed 6 -swaps applied to D. 

Indeed, it is easily verified from Eq. that: 

P(D(z,j))-P(D(z,0) 
■P(D(fc,j))-P(D(fc,/)) 
= P(D(2,j) + 5)-P(D(2,/)-5) 
■P{B{k,j)-6)-P{B{k,l) + 6) 
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for any S, rows i, k and columns 

This means that for any 2x2 submatrix of D, adding a given number to 
its diagonal and subtracting the same number from its off-diagonal elements 
yields the total probability of the data invariant. 

More generally, the MaxEnt distribution assigns the same probability to 
any two matrices that have the same row and column sums. This can be 
seen from the fact that Eq. is independent from D as soon as the row 
and column sums J2j^ihj) ^iid J2i^ihj) are given. In statistical terms: 
the row and column sums are sufficient statistics of the data D. We can 
formalize this in the following Theorem: 

Theorem 3.2. The MaxEnt distribution for a matrix D, conditioned on 
constraints on row and column sums of the form 

j 

i 

denoted as P(D\ J2j j) = d^, J2i D(i, j) = dj), is identical to the uniform 
distribution over all databases satisfying these constraints. 

This Theorem further strengthens the connection between the uniform 
distribution over all matrices with fixed row and column sums, as sampled 
from in [5l [TOl [12] using swap randomizations, and the MaxEnt distribution. 



3.2 (5-swaps in databases and networks 

The invariants that have been used in computation intensive approaches for 
defining null models for databases and networks are special cases of these 
more generally applicable 5-swaps. 

For binary databases the condition D(z, D(/c, j)— 5, D(i, I)— 6, D(fc, /) + 

6 E T> corresponds to the fact that either 6 = —1 and D(z, k; j, I) = 

or 5 = 1 and D(i, fc; j, /) = ^ ^ ) ' 5-swap is identical to a swap 

in a binary database. This shows that the MaxEnt distribution of a binary 
database is invariant under swaps as defined in [5]. For positive real- valued 
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databases, the (5-swap operations reduce to the Addition Mask method in 

Similarly, the edge swap operations on networks can be understood as 6- 
swaps with 6 equal to 1 or —1. For networks in which self- loops are forbidden, 
swaps with rows i, k and columns j, I must satisfy {i, k} fl {j, /} = 0, such 
that no self-loops are created. For undirected networks, a 5-swap operation 
with rows i, k and columns j, I should always be accomplished by a symmetric 
(5-swap with rows j, I and columns i, k, in order to preserve symmetry. 

Besides these special cases, allowed (5-swaps, found here as simple invari- 
ants of the MaxEnt distribution, can be used for randomizing any of the types 
of databases or networks discussed in this paper. This being said, it should be 
reiterated that the availability of the MaxEnt distribution should make ran- 
domizing the data using 5-swaps unnecessary. Instead one can simply sample 
directly from the MaxEnt distribution, thus avoiding the computational cost 
and potential convergence problems faced in randomizing the data. An ex- 
ception would be if it is crucial that the row and column sums are preserved 
exactly rather than in expectation. 

4 Experiments 

In this Section we present experiments that illustrate the computational cost 
of the MaxEnt modelling approach on a number of problems. All experiments 
were done on a 2GHz Pentium Centrino with 1GB Memory. 

4.1 Modelling binary databases 

We report empirical results on four databases: two textual datasets, turned 
into databases by considering words as items and documents as transactions, 
and two other databases commonly used for evaluation purposes. 

ICDM All ICDM abstracts until 2007, where each abstract is represented by 
as a transaction and words are items. Stop words have been removed, 
and stemming performed. 

Mushroom A publicly available item-transaction dataset |1J. 

Pubmed All Pubmed abstracts retrieved by querying with the search query 
"data mining" , after stop word removal and stemming. 
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Table 2: Some statistics for the databases investigated. 





# items 


# tids 


support 


^ closeds 


ICDM 


4,976 


859 


5 


365,732 


Mushroom 


120 


8,124 


812 


6,298 


Pubmed 


12,661 


1,683 


10 


1,249,913 


Retail 


16,470 


88,162 


8 


191,088 



Retail A dataset about transactions in a Belgian supermarket store, where 
transactions and items have been anonymized [3]. 

Some statistics are gathered in Table ([2]). The Table also mentions support 
thresholds used for some of the experiments reported below, and the numbers 
of closed itemsets satisfying these support thresholds. 

Fitting the MaxEnt model The method we used to fit the model is a 
preconditioned gradient descent method with Jacobi preconditioner (see e.g. 
[T5]). implemented in C++. It is quite conceivable that more sophisticated 
methods will lead to significant further speedups, but this one is particularly 
easy to implement. 

To illustrate the speed to compute the MaxEnt distribution. Fig. [2] shows 
plots of the convergence of the squared norm of the gradient to zero, for 
the first 20 iterations. The initial value for all dual variables was chosen 
to be equal to 0. Noting the logarithmic vertical axis, the convergence ap- 
pears clearly exponential. The lower plot in Fig. [2] shows the convergence 
of the dual objective to its minimum over the iterations, clearly a very fast 
convergence in just a few iterations. 

In all our experiments we stopped the iterations as soon as this normalized 
squared norm became smaller than 10~^^, which is close to machine accuracy 
and accurate enough for all practical purposes. The number of iterations 
required and the overall computation time are summarized in Table O 

Assessing data mining results Here we illustrate the use of the MaxEnt 
model for assessing data mining results in the same spirit as [5] . Figure Oplots 
the number of closed itemsets retrieved on the original data. Additionally, it 
shows box plots for the results obtained on randomly sampled databases from 



16 



Table 3: The number of iterations required, and the computation time in 
seconds to fit the probabihstic model to the data. 





7^ iterations time (s) 


ICDM 
Mushroom 
Pubmed 
Retail 


13 0.35 
37 0.02 
15 1.24 
18 2.80 




5 10 15 




5 10 15 20 

Iteration 



Figure 2: Top: the squared norm of the gradient on a logarithmic scale as a 
function of the iteration number, plotted for four databases: ICDM abstracts. 
Mushroom, Pubmed abstracts, and Retail. This plot shows the exponential 
decrease of the gradient of the dual optimization problem. In the second 
plot, the convergence of the Lagrange dual is shown for the same databases. 
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ICDM 




1 23456789 10 



Pubmed 




1 23456789 10111213 
Itemset size 



Mushroom 




1 2 3 4 5 6 7 8 910111213141516 
Retail 




1 2 3 4 5 6 7 8 9 1011 12 
Itemset size 



Figure 3: For the four datasets under investigation, these plots show the 
number of closed itemsets on a logarithmic scale, as a function of their size. 
Additionally, box plots are shown for the number of closed itemsets as a 
function of size found on 100 randomized datasets, based the MaxEnt distri- 
bution. 
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the MaxEnt model with expected row sums and column sums constrained to 
be equal to their values on the original data. If desired, one could extract one 
global measure from these results, as in [5] , and compute an empirical p- value 
by comparing that measure obtained on the actual data with the result on 
the randomized versions. However, the plots given here do not force one to 
make such a choice, and they still give a good idea of the patterns contained 
in the datasets. 

Note that the possible applications of the MaxEnt model will likely reach 
further than the assessment of data mining results. However, this is beyond 
the scope of the current paper, and we will get back to this in the Discussion 
Section. 

4.2 Modelling networks 

To evaluate the MaxEnt model for networks, we artificially generated power- 
law (weighted) degree distributions for networks of various sizes between 
n = 10 and n = 10^ nodes, with a power-law exponent of 2.5. I.e., for each n 
we sampled n expected (weighted) degrees di from the distribution P{di) ~ 
d^"^'^. A power-law degree distribution with this exponent is often observed in 
realistic networks [11] , so we believe this is a representative set of examples. 
However, non-reported experiments on other realistic types of networks yield 
quahtatively similar results. For each of these degree distributions, we fitted 
four different types of undirected networks: unweighted networks with and 
without self-loops, and positive integer-valued weighted networks with and 
without self-loops. 

To fit the MaxEnt models for networks we made use of Newton's method, 
which we implemented in MATLAB. As can be seen from Fig. |U the com- 
putation time was under 30 seconds even for the largest network with 10^ 
nodes. The number of Newton iterations is lower than 50 for all models and 
degree distributions considered. 

This fast performance can be achieved thanks to the fact that the number 
of different degrees observed in the degree distribution is typically much 
smaller than the size of the network (see Discussion in Sec. 12.21) . The bottom 
graph in Fig. |U showing the number of Lagrange multipliers as a function of 
the network size supports this. The memory requirements remain well under 
control for the same reasons. 

It should be pointed out that in the worst case for dense or for weighted 
networks (and in particular for real- valued weights), the number of distinct 
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expected weighted degrees and hence the number of Lagrange multiphers can 
be as large as the number of nodes n. This would make it much harder to 
use off-the-shelf optimization tools for n much larger than 1000. However, 
the problem can be made tractable again if it is acceptable to approximate 
the expected weighted degrees by grouping subsets of them together into 
bins, and replacing their values by a bin average. In this way the number 
of Lagrange multipliers can be brought below 1000. We postpone a full 
discussion of this and other strategies to a later paper. 

5 Discussion 

We will first discuss how some existing explicit models are related to par- 
ticular cases of the MaxEnt models introduced in this paper. Then, we will 
discuss the implications for data mining of the availability explicit models. 

5.1 Connections with literature 

Interestingly, the MaxEnt model for binary matrices introduced in this paper 
is formally identical to the Rasch model, known from psychometrics [T3] . 
This model was introduced to model the performance of individuals (rows) 
to questions (columns). The matrix elements indicate which questions were 
answered correctly or incorrectly for each individual. The Lagrange dual 
variables are interpreted as persons' abilities for the row variables A^, and 
questions' difficulties A^. Remarkably, the model was not derived from the 
MaxEnt principle but stated directly. 

A similar connection exists with the so-called p* models from social net- 
work analysis [Tl|. Although motivated differently, the pi model in particular 
is formally identical to our MaxEnt model for unweighted networks. 

Thus, the present paper provides an additional way to look at these widely 
used models from psychometrics and social network analysis. Furthermore, as 
we have shown, the MaxEnt approach suggests generalizations, in particular 
towards non-binary databases and weighted networks. 

Another connection is to prior work on random network models for net- 
works with prescribed degree sequences (see [H] and references therein). The 
most similar model to the ones discussed in this paper is the one from P|. In 
this paper, the authors propose to assume that edge occurrences are indepen- 
dent, with each edge probability proportional to the product of the degrees 
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Figure 4: Top: The computation time as a function of the size number of 
nodes in the network (left). A marker x is used for the unweighted model 
with self-loops, o for the unweighted model without self-loops, V for the 
weighted model with self-loops, and □ for the weighted model without self- 
loops. Note the log-log scale. Middle: The number of iterations required by 
the Newton algorithm before convergence. Note the log-scale on the hori- 
zontal axis. Bottom: the number of Lagrange multipliers (i.e. the number 
of variables in the dual of the MaxEnt optimization problem) for the degree 
sequences investigated, as a function of the network size. Again, note the 
log-scale on the horizontal axis. 
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of the pair of nodes considered. In the notation of the present paper: 



P{B) = l[Pm,j)) with P(D(^,j)) = ^, 

where s = di. Also for this model the constraints on the expected row 
and column sums are satisfied. 

It would be too easy to simply dismiss this model by stating that among 
all distributions satisfying the expected row and column sum constraints, it is 
not the maximal entropy one, such that it is biased in some sense. However, 
this drawback can be made more tangible: the model represents a proba- 
bility distribution only if maxj^ didj < s, which is by no means true in all 
practical applications, in particular in power-law graphs. This shortcoming 
is a symptom of a bias of this model: it disproportionally favours connec- 
tions between pairs of nodes both of high degree, such that for nodes of too 
high degrees the edge 'probability' suggested becomes larger than 1. A brief 
remark considering a similar model for binary databases was made in [5], 
where it was dismissed by the authors on similar grounds. 

5.2 Relevance to data mining 

The implications of the ability to construct explicit models for databases and 
networks are vast. It suffices to look at an area of data mining that has been 
using data models for a long time in the search for patterns: string analysis. 

In the search for patterns in strings, it is common to adopt Markov models 
of an appropriate order as a background model. Then, patterns that some- 
how depart from the expected under this background model are reported as 
potentially interesting. For example, a comprehensive survey about motif 
discovery algorithms lists various algorithms that model DNA sequences 
using Markov models, after which motifs are scored based on various in- 
formation theoretic or statistical measures, depending on the method. One 
notable example is MotifSampler [17]. Certainly, randomization strategies 
have been and continue to be used successfully for testing the significance 
of string patterns. However, it should be emphasized that the methods for 
discovering string patterns that contrast with a background model have be- 
come possible only thanks to the availability of that background model in an 
explicit form. 
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Therefore, we believe it is to be expected that the availabihty of exphcit 
models for databases and networks will enable the introduction of new mea- 
sures of interestingness for patterns found in such data, be they recurring 
itemsets or network motifs, other local patterns, or more global patterns 
such as correlations between items or transactions, the clustering coefficient 
of a network, and so on. 

We also wish to stress that the MaxEnt modelling strategy is applicable 
for other types of constraints as well. For example, it would be possible 
to add constraints for the support of specific itemsets, or the presence of 
certain cliques in a network, etc. Such modifications would be in line with 
the alternative randomization strategies suggested in [B]. It is also possible to 
allow matrix elements to become negative, if further constraints for example 
on the variance are introduced (the resulting MaxEnt distribution would 
then be a product of normal distributions). Furthermore, the strategy can 
be applied to other data types as well. 

6 Conclusions 

In recent years, a significant amount of data mining research has been devoted 
to the statistical assessment of data mining results. The strategies used for 
this purpose have often been based on randomization testing and related 
ideas. The use of randomization strategies was motivated by the lack of 
explicit models for the data. 

We have shown that in a wide variety of cases it is possible to construct 
explicitly represented distributions for the data. The modelling approach we 
have suggested is based on the maximum entropy principle. We have illus- 
trated that fitting maximum entropy distributions often boils down to well- 
posed convex optimization problems. In the experiments, we have demon- 
strated how the MaxEnt model can be fitted extremely efficiently to large 
databases and networks in a matter of seconds on a basic laptop computer. 

This newfound ability holds several promises for the domain of data min- 
ing. Most trivially, the assessment of data mining results may be facilitated 
in certain cases, as sampling random data instances from an explicit model 
may be easier than using MCMC sampling techniques from a model defined 
in terms of invariants of the distribution. More importantly, we believe that 
our results may spark the creation of new measures of informativeness for 
patterns discovered from data, in particular for databases and for networks. 
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